- Create a PDF document and a website to communicate your own analysis
- Including your own text, analysis, table and chart
- Using data from an external source (a CSV file)
- Understand the data science workflow in R
- Gain confidence in R
1 - Use RStudio for everything. R is the engine; RStudio is the Interface. If you do half your data cleaning in Excel, there will be no record of it and we won't be able to fix mistakes.
2 - Our work is in the form of a 'Recipe Book': A step-by-step guide to the data inputs (the ingredients), our analysis (the cooking instructions) and the outputs (the picture of the perfect meal).
3 - Use R Markdown (.rmd) files to combine analysis and outputs: R markdown files allow us to combine data processing in R - in clearly-defined 'chunks' - with the display of text, tables, charts and maps.
4 - Output is produced only when we press 'Knit': It is NOT an interactive playground like excel (though it can be with Ctrl+Enter).
5 - Our work should be self-explanatory and reproducible: Anybody with R should be able to open our work, press 'Knit' and produce the same outputs.
6 - Organize your work in Projects in R: For each major analysis, it's best to choose 'File' -> 'New Project' -> 'New Directory' from Rstudio. Save all your data inputs and outputs in this folder (which Rstudio will do automatically).
7 - Data frames (tables) are the main building block of our analysis: We focus on manipulating and visualizing tables of data, as these are the best way of organizing our data.
8 - Use meaningful names in your work: 'data_v1b_2_061215' won't mean anything in 3 months! All files and objects should reflect their role in the analysis.
9 -Process our data in a 'tidy' way: This means we will use a set of compatible 'packages' called the 'tidyverse' to make our analysis transparent and avoid common problems.
$$ a^2 + b^2 = c^2 $$\[ a^2 + b^2 = c^2 \]
new_object <- old_objectdata_frame %>% action_on_dataframe#Comments go here and won't be processed by Rinstall.packages("New_package") ONCE, thenlibrary("New_package") at the start of each documentanswer <- 2 + 2 answer
## [1] 4
inputs <- seq(0,1,0.2) answer <- inputs*10 answer
## [1] 0 2 4 6 8 10
Set the title, date, author etc. in the header in markdown
read_csvfilter, select, mutateleft_joinmutate, summarizezeligkable, stargazerggplotleaflet, mapviewread_csvdata <- read_csv("data.csv")
library(foreign)
data <- read.spss("data.sav")
data <- read.dta("data.dta")
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
kable()
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
ggplot() + geom_point(aes(x=air_time,y=dep_delay))
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
zelig(dep_delay ~ carrier,data=.,model="ls") %>%
stargazer(digits=3)
select specific variables (columns)slice observations (rows)filter observations (rows) by conditions (based on values in columns)count number of observationsrename variables (columns)arrange table in the order of a particular variablemutate (change) values of an existing or create a new variablesummarize data by creating statisticsround values to a specifc number of decimal placesflights %>% select(carrier,origin,air_time,distance,dep_delay)
flights %>% select(carrier,origin,air_time,distance,dep_delay)
## # A tibble: 5 x 1 ## air_time ## <dbl> ## 1 227 ## 2 227 ## 3 160 ## 4 183 ## 5 116
flights %>% slice(1:2)
flights %>% slice(1:2)
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2 ## 2 UA LGA 227 1416 4
flights %>% filter(origin=="JFK")
flights %>% filter(origin=="JFK")
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 AA JFK 160 1089 2 ## 2 B6 JFK 183 1576 -1
flights %>% mutate(air_time=round(air_time/60,3))
flights %>% mutate(air_time=round(air_time/60,3))
## # A tibble: 5 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 3.783 1400 2 ## 2 UA LGA 3.783 1416 4 ## 3 AA JFK 2.667 1089 2 ## 4 B6 JFK 3.050 1576 -1 ## 5 DL LGA 1.933 762 -6
flights %>% mutate(speed=round(distance/air_time,3))
flights %>% mutate(speed=round(distance/air_time,3))
## # A tibble: 5 x 6 ## carrier origin air_time distance dep_delay speed ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2 6.167 ## 2 UA LGA 227 1416 4 6.238 ## 3 AA JFK 160 1089 2 6.806 ## 4 B6 JFK 183 1576 -1 8.612 ## 5 DL LGA 116 762 -6 6.569
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
## # A tibble: 1 x 1 ## avg_distance ## <dbl> ## 1 1248.6
These actions can be 'piped' together:
We want to find the
average
speed of
United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1) %>% as.numeric()
## [1] 420.9
These actions can be 'piped' together:
We want to find the
average
speed of
United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
avg_speed <- flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
The average speed of United Flights is `r avg_speed` miles per hour.
The average speed of United Flights is 420.9 miles per hour.
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
420.9
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
394.3
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable()
| carrier | origin | air_time | distance | dep_delay |
|---|---|---|---|---|
| UA | EWR | 227 | 1400 | 2 |
| UA | LGA | 227 | 1416 | 4 |
| AA | JFK | 160 | 1089 | 2 |
| B6 | JFK | 183 | 1576 | -1 |
| DL | LGA | 116 | 762 | -6 |
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable(caption="Example Table", align="lcccc")
| carrier | origin | air_time | distance | dep_delay |
|---|---|---|---|---|
| UA | EWR | 227 | 1400 | 2 |
| UA | LGA | 227 | 1416 | 4 |
| AA | JFK | 160 | 1089 | 2 |
| B6 | JFK | 183 | 1576 | -1 |
| DL | LGA | 116 | 762 | -6 |
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay))
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) + geom_smooth(aes(x=dep_time,y=dep_delay))
flights %>%
filter(carrier=="UA") %>%
ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
geom_smooth(aes(x=dep_time,y=dep_delay)) +
ggtitle("Example Chart") +
xlab("Departure Time") +
ylab("Departure Delay")
flights %>% ggplot() + geom_bar(aes(x=dep_delay))
flights %>% ggplot() + geom_bar(aes(x=dep_delay)) + xlim(-30,100)
flights %>% group_by(origin) %>% summarize(avg_delay=mean(dep_delay,na.rm=TRUE)) %>% ggplot() + geom_col(aes(x=origin, y=avg_delay))